Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

Where I found the course?

I just picked this course from University of Helsinki course catalog.


Feelings about the course

I am really excited to learn R, because I want to get new tools to help me to process data. I usually use Excel and as a software engineer it would be better to use my current skills at their maximum.

The first examples were really basic, but I found it interesting because the idea on learning this kind of programming language is to get things done quickly.

Github: https://github.com/plipplopplipplop/IODS-project

Github Pages: https://plipplopplipplop.github.io/IODS-project


Chapter 2: Regression and model validation

Describe the work you have done this week and summarize your learning.


0. Pre-page code

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

rd <- read.table(".\\data\\learning2014.txt",
                 as.is = TRUE,
                 sep = "\t",
                 header = TRUE)

1. Data set explanation

str(rd)
## 'data.frame':    183 obs. of  71 variables:
##  $ Aa      : int  3 2 4 4 3 4 4 3 2 3 ...
##  $ Ab      : int  1 2 1 2 2 2 1 1 1 2 ...
##  $ Ac      : int  2 2 1 3 2 1 2 2 2 1 ...
##  $ Ad      : int  1 2 1 2 1 1 2 1 1 1 ...
##  $ Ae      : int  1 1 1 1 2 1 1 1 1 1 ...
##  $ Af      : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ ST01    : int  4 4 3 3 4 4 5 4 4 4 ...
##  $ SU02    : int  2 2 1 3 2 3 2 2 1 2 ...
##  $ D03     : int  4 4 4 4 5 5 4 4 5 4 ...
##  $ ST04    : int  4 4 4 4 3 4 2 5 5 4 ...
##  $ SU05    : int  2 4 2 3 4 3 2 4 2 4 ...
##  $ D06     : int  4 2 3 4 4 5 3 3 4 4 ...
##  $ D07     : int  4 3 4 4 4 5 4 4 5 4 ...
##  $ SU08    : int  3 4 1 2 3 4 4 2 4 2 ...
##  $ ST09    : int  3 4 3 3 4 4 2 4 4 4 ...
##  $ SU10    : int  2 1 1 1 2 1 1 2 1 2 ...
##  $ D11     : int  3 4 4 3 4 5 5 3 4 4 ...
##  $ ST12    : int  3 1 4 3 2 3 2 4 4 4 ...
##  $ SU13    : int  3 3 2 2 3 1 1 2 1 2 ...
##  $ D14     : int  4 2 4 4 4 5 5 4 4 4 ...
##  $ D15     : int  3 3 2 3 3 4 2 2 3 4 ...
##  $ SU16    : int  2 4 3 2 3 2 3 3 4 4 ...
##  $ ST17    : int  3 4 3 3 4 3 4 3 4 4 ...
##  $ SU18    : int  2 2 1 1 1 2 1 2 1 2 ...
##  $ D19     : int  4 3 4 3 4 4 4 4 5 4 ...
##  $ ST20    : int  2 1 3 3 3 3 1 4 4 2 ...
##  $ SU21    : int  3 2 2 3 2 4 1 3 2 4 ...
##  $ D22     : int  3 2 4 3 3 5 4 2 4 4 ...
##  $ D23     : int  2 3 3 3 3 4 3 2 4 4 ...
##  $ SU24    : int  2 4 3 2 4 2 2 4 2 4 ...
##  $ ST25    : int  4 2 4 3 4 4 1 4 4 4 ...
##  $ SU26    : int  4 4 4 2 3 2 1 4 4 4 ...
##  $ D27     : int  4 2 3 3 3 5 4 4 5 4 ...
##  $ ST28    : int  4 2 5 3 5 4 1 4 5 2 ...
##  $ SU29    : int  3 3 2 3 3 2 1 2 1 2 ...
##  $ D30     : int  4 3 4 4 3 5 4 3 4 4 ...
##  $ D31     : int  4 4 3 4 4 5 4 4 5 4 ...
##  $ SU32    : int  3 5 5 3 4 3 4 4 3 4 ...
##  $ Ca      : int  2 4 3 3 2 3 4 2 3 2 ...
##  $ Cb      : int  4 4 5 4 4 5 5 4 5 4 ...
##  $ Cc      : int  3 4 4 4 4 4 4 4 4 4 ...
##  $ Cd      : int  4 5 4 4 3 4 4 5 5 5 ...
##  $ Ce      : int  3 5 3 3 3 3 4 3 3 4 ...
##  $ Cf      : int  2 3 4 4 3 4 5 3 3 4 ...
##  $ Cg      : int  3 2 4 4 4 5 5 3 5 4 ...
##  $ Ch      : int  4 4 2 3 4 4 3 3 5 4 ...
##  $ Da      : int  3 4 1 2 3 3 2 2 4 1 ...
##  $ Db      : int  4 3 4 4 4 5 4 4 2 4 ...
##  $ Dc      : int  4 3 4 5 4 4 4 4 4 4 ...
##  $ Dd      : int  5 4 1 2 4 4 5 3 5 2 ...
##  $ De      : int  4 3 4 5 4 4 5 4 4 2 ...
##  $ Df      : int  2 2 1 1 2 3 1 1 4 1 ...
##  $ Dg      : int  4 3 3 5 5 4 4 4 5 1 ...
##  $ Dh      : int  3 3 1 4 5 3 4 1 4 1 ...
##  $ Di      : int  4 2 1 2 3 3 2 1 4 1 ...
##  $ Dj      : int  4 4 5 5 3 5 4 5 2 4 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ d_sm    : int  15 13 15 13 16 19 17 15 19 16 ...
##  $ d_ri    : int  15 10 16 15 14 20 17 13 17 16 ...
##  $ d_ue    : int  13 12 11 14 14 18 12 11 16 16 ...
##  $ deep    : int  43 35 42 42 44 57 46 39 52 48 ...
##  $ st_os   : int  14 14 13 12 16 15 12 15 16 16 ...
##  $ st_tm   : int  13 8 16 13 13 14 6 17 18 12 ...
##  $ stra    : int  27 22 29 25 29 29 18 32 34 28 ...
##  $ su_lp   : int  10 9 7 7 8 8 5 10 7 10 ...
##  $ su_um   : int  11 12 8 11 12 10 5 11 6 12 ...
##  $ su_sb   : int  10 17 12 9 14 11 13 13 13 14 ...
##  $ surf    : int  31 38 27 27 34 29 23 34 26 36 ...

The data set is collected as a part of international research project where the study topic was to ask questions about teaching and learning. There were a total of 183 participants who answered the questionnaire. The questionaire was about how do attendee’s study in the course. Are they doing a deep learning, where they study hard and focus into details, or are they strategic where they plan their studies and work evenly through the course time. Last on is do they study on surface so they get the course easily done and move to next topic early.

There are 32 questions (in data set marked with STxx, SUxx, and Dxx) which are collected for three internal dimensions (deep, surface, and strategic). Each of the set of the questions (STxx, SUxx, and Dxx) are classified and later collected into collections which interpret higher idea behind each approach (d_xx, su_xx and st_xx). These classifications are used to create the overall points of each approach (values deep, surf, and stra).

Assistant questions are asked in the questionnaires, which are measured with Aa-Af, Ca-Ch, and Da-Dj. Additionally the basic data are also stored into this data set (age, gender, attitude and total points).


2. Graphical overview

myplot <- ggpairs(rd,
                  columns = c(57, 58, 59, 64, 67, 71),
                  mapping = ggplot2::aes(col = gender, alpha = 0.3),
                  lower = list(combo = wrap("facethist", bins = 20)))

myplot

The charts are colour coded by the gender (red being Female). With this chart the reader can see distribution of questionnaire points depending on the attendee’s gender, age, and attitude to statistics

The students were quite young and the attitude was a little bit better with Males. Overall points were quite good and the deep approach to studying was the most common one. The strategic method was used in combination of deep learning but the surface learning approach was not that common.


3. Regression model

diagplots <- lm(points ~ surf + stra + deep, rd)
summary(diagplots)
## 
## Call:
## lm(formula = points ~ surf + stra + deep, data = rd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.222  -2.516   1.750   5.531  13.291 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.12137    6.81477   3.980  0.00010 ***
## surf        -0.17294    0.10303  -1.678  0.09500 .  
## stra         0.28027    0.10246   2.735  0.00686 ** 
## deep        -0.17222    0.09888  -1.742  0.08326 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.46 on 179 degrees of freedom
## Multiple R-squared:  0.06432,    Adjusted R-squared:  0.04864 
## F-statistic: 4.101 on 3 and 179 DF,  p-value: 0.007617

I chose three explanatory variables: deep, surf and stra.

I chose these variables because they are the collection of the set of questions. These three variables have relationship towards the total points of the questionnaire.

Residuals are essentially the difference between the actual observed response values. In this case what is the largest variation of the points compared into the total amount of points.

Estimate is the individual approach points which are estimated to be in line with the total amount of points. Std. Error is the average error value of the approach points. Coefficient t values count how far away the points are from total points. Last the Pr is how likely it will deviate from t value.


4. Relationship between the chosen explanatory variables

Multiples R-squared model takes every unknown parameter to account, where adjusted R-squared removes those which are far away from the linear model. 6% variance means that the model fits well with the actual data.


5. Diagnostic plots

par(mfrow = c(2,2))
plot(diagplots, which = c(1, 2, 5))

Not a glue what these are.




Chapter 3: Logistic Regression

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 06 05:11:34 2021"
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(boot)

url <- "https://github.com/rsund/IODS-project/raw/master/data"
url_alc <- paste(url, "alc.csv", sep = "/")

alc <- read.csv(url_alc, header = TRUE)
dim(alc)
## [1] 370  51
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "n"          "id.p"       "id.m"      
## [31] "failures"   "paid"       "absences"   "G1"         "G2"        
## [36] "G3"         "failures.p" "paid.p"     "absences.p" "G1.p"      
## [41] "G2.p"       "G3.p"       "failures.m" "paid.m"     "absences.m"
## [46] "G1.m"       "G2.m"       "G3.m"       "alc_use"    "high_use"  
## [51] "cid"

This joined data set contains alcohol consumption of students in two Portuguese schools. Data was collected using school reports and questionnaires. Data set contains 370 results from 29 sets of data.

  1. school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
  2. sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
  3. age - student’s age (numeric: from 15 to 22)
  4. address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
  5. famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
  6. Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
  7. Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
  8. Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
  9. Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  10. Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  11. reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
  12. guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
  13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  14. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  15. schoolsup - extra educational support (binary: yes or no)
  16. famsup - family educational support (binary: yes or no)
  17. activities - extra-curricular activities (binary: yes or no)
  18. nursery - attended nursery school (binary: yes or no)
  19. higher - wants to take higher education (binary: yes or no)
  20. internet - Internet access at home (binary: yes or no)
  21. romantic - with a romantic relationship (binary: yes or no)
  22. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  23. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
  24. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
  25. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  26. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  27. health - current health status (numeric: from 1 - very bad to 5 - very good)
  28. alc_use - The average of Dalc and Walc
  29. high_use - TRUE if alc_use is higher than 2 and FALSE otherwise

3. Personal hypothesis

I chose the following four variables for personal hypothesis on alcohol consumption: 1. romantic 2. higher 3. internet 4. schoolsup

My hypothesis is that a romantic relationship and higher education decreases the alcohol consumption, but lacking the support from school and not having internet connection at home (need to stay long at campus) increases the alcohol consumption.

4. Personal hypothesis on numbers

phPlot <- ggplot(alc, aes(x = romantic, y = alc_use, col = sex))
phPlot + geom_boxplot() +
          ylab("Average alcohol consumption") +
          xlab("With a romantic relationship")

As seen in this box plot. The alcohol consumption didn’t drop that much when person was in romantic relationship. Female workday usage slightly dropped, but with males, the consumption slightly increased on the less consuming males.

phPlot <- ggplot(alc, aes(x = higher, y = alc_use, col = sex))
phPlot + geom_boxplot() +
        ylab("Average alcohol consumption") +
        xlab("Wants to take higher education")

As assumed the higher education lowered the alcohol consumption, with males. But the alcohol consumption on females who want to take higher education, is higher. Which is total opposite on my assumption.

phPlot <- ggplot(alc, aes(x = internet, y = alc_use, col = sex))
phPlot + geom_boxplot() +
        ylab("Average alcohol consumption") +
        xlab("Internet access at home")

Here is second surprising results. Internet increases alcohol consumption of female persons. With males the median staid the same, but the consumption got more dispersed on the maximum and lower ends.

phPlot1 <- ggplot(alc, aes(x = schoolsup, y = alc_use, col = sex))
phPlot1 + geom_boxplot() +
        ylab("Average alcohol consumption") +
        xlab("Extra educational support")

This was the only assumption which went as I thought. More educational support students get, the lower is their alcohol consuption. It was a larger drop on both genders, but I assumed it was so.

5. Logistic Regression

m <- glm(high_use ~ romantic + higher + internet + schoolsup, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ romantic + higher + internet + schoolsup, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2857  -0.8807  -0.7814   1.5065   1.7737  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    0.2511     0.5980   0.420    0.675  
## romanticyes   -0.2831     0.2547  -1.112    0.266  
## higheryes     -1.2242     0.5263  -2.326    0.020 *
## internetyes    0.2261     0.3289   0.688    0.492  
## schoolsupyes  -0.3105     0.3574  -0.869    0.385  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 444.76  on 365  degrees of freedom
## AIC: 454.76
## 
## Number of Fisher Scoring iterations: 4

In the summary we can see that the variable higher is the only one which p-value is under 5%, so it can be considered as a significant. Now we can confirm on the charts on my personal hypothesis that the only really significant chart was indeed the variable higher. From the deviance we can see that high alcohol usage is not that common within 15-22 years old, therefore it is more on the minimum side. But there are variation between min and max.

OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR     2.5 %    97.5 %
## (Intercept)  1.2854900 0.3969561 4.2627835
## romanticyes  0.7534579 0.4527659 1.2317868
## higheryes    0.2939864 0.1007957 0.8226494
## internetyes  1.2537265 0.6707657 2.4542330
## schoolsupyes 0.7330743 0.3506197 1.4390072

As you can see from odds ratio the variable higher is the only one which is really dispersed. The other variables are quite static, without that much variation. The coefficient intervals show that there are big chunks of values are in the same area, but there are this small set of data which is dispersed away from each other.

6. Explore the predictive power of you model

prob <- predict(m, type = "response")
alc <- mutate(alc, probability = prob)
alc <- mutate(alc, prediction = probability > 0.5)

table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252    7
##    TRUE    103    8
table(high_use = alc$high_use, prediction = alc$prediction) %>%
        prop.table() %>%
        addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.68108108 0.01891892 0.70000000
##    TRUE  0.27837838 0.02162162 0.30000000
##    Sum   0.95945946 0.04054054 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2972973

As you can see from the data, the prediction about the high alcohol usage is quite accurate when the high alcohol consumption is TRUE. When it is FALSE, well then it is not that accurate.

7. Bonus

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.3108108

Using the same method that in Datacamp. With the last 10-fold cross validation I got the same variation 0,5%. Although the predictions were 3% better.



Exercise 4; Clustering and classification

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 06 05:11:37 2021"
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Task 2: Explore the data

data("Boston")

dim(Boston)
## [1] 506  14
colnames(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"

Boston data frame is a sample R dataset from R library MASS. The Boston data frame has 506 rows and 14 columns. This data frame contains the following columns:

  • crim = per capita crime rate by town.
  • zn = proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus = proportion of non-retail business acres per town.
  • chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox = nitrogen oxides concentration (parts per 10 million).
  • rm = average number of rooms per dwelling.
  • age = proportion of owner-occupied units built prior to 1940.
  • dis = weighted mean of distances to five Boston employment centres.
  • rad = index of accessibility to radial highways.
  • tax = full-value property-tax rate per $10,000.
  • ptratio = pupil-teacher ratio by town.
  • black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat = lower status of the population (percent).
  • medv = median value of owner-occupied homes in $1000s.

Task 3: Graphical overview of the dataset

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
par(mfrow=c(7,2), fin=c(4,4), mar=c(2,5,2,1), las=1, bty="n")
plot(Boston$crim, ylab = "Per capita", main = "Crime Rate")

plot(Boston$zn, ylab = "lots over 25,000 sq.ft", main = "Land zoned for lots")

plot(Boston$indus, ylab = "acres per town", main = "Non-retail business acres")

plot(Boston$medv, ylab = "Value of homes", main = "Owner-occupied homes")

plot(Boston$nox, ylab = "parts per 10 million", main = "NOX concentration")

plot(Boston$rm, ylab = "number of rooms", main = "Rooms per dwelling")

plot(Boston$age, ylab = "built prior to 1940", main = "Owner-occupied units built")

plot(Boston$dis, ylab = "weighted mean", main = "Distances to employment centres")

plot(Boston$rad, ylab = "index", main = "Accessibility to highways")

plot(Boston$tax, ylab = "tax rate per 10,000", main = "Property-tax rate")

plot(Boston$ptratio, ylab = "ratio", main = "Pupil-teacher")

plot(Boston$black, ylab = "1000(Bk - 0.63)^2", main = "Proportion of blacks")

plot(Boston$lstat, ylab = "Percent", main = "Lower status")

Compared to crime rate. The higher crime rate seems to be in rented apartments, with higher property tax rate. The owners doesn’t live in the same area. Also higher NOX rate (car pollution) and closer to highway seems to be increasing the crime rate. Employment centers are far away. On the other hand the crime rate is not relevant to skin color or apartment room count.

Task 4: Standardize the dataset

# Make standardized dataset
boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)

# Create categorical variable
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# Drop the crime rate from old data and add new categorial
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

# Divide the dataset to train and test sets.
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

dim(test)
## [1] 102  14
dim(train)
## [1] 404  14

Comparing the two summaries tells us that scaling doesn’t work on the most of the variables. When scaling, there are variables which can’t be negative. After scaling these variables are negative, because no clustering has being done.

Task 5: Linear discriminant analysis on train set

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crim)

plot(lda.fit, dimen = 2, col = classes, pch = classes, xlab = "Crime Rate", ylab = "Other factors")
lda.arrows(lda.fit, myscale = 2)

From this data we can see that accessibility to radial highways affects the crime rates the most. Nitrogen oxide concentration also has a noted link in the data for incrieasing the crime rate. Instead when proportion of residential land zones increases, the crime rates are lower.

Task 6: Predict the test dataset crime variable with LDA model

crimeCat <- test$crime
test <- dplyr::select(test, -crime)

lda.pred <- predict(lda.fit, newdata = test)
table(correct = crimeCat, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       11       9        2    0
##   med_low    6      12       14    0
##   med_high   2       4       14    1
##   high       0       0        0   27

High crime rate predictions are quite consent and it seems to be more variance on the low and med_low predictions.

Task 7: Standardize the original Boston dataset

data("Boston")
Boston <- scale(Boston)

km <-kmeans(Boston, centers = 4)
pairs(Boston, col = km$cluster)

I came up with four clusters. The reason for this is that from the charts the the clusters are not that dispersed and there are big centers from all the charts, which can be combined. With five clusters, there were one too small cluster or it was too dispersed. With three, there was even clusters, but with four there are also four quite even clusters. More is better.

Super Bonus: 3d Matrix

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$centers)

Both give the same value, but with crime, the values are colored.



Chapter 5: Dimensionality reduction techniques

Describe the work you have done this week and summarize your learning.

date()
## [1] "Mon Dec 06 05:11:46 2021"
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.2
library(tidyr)
library(ggplot2)
data("tea")
human <- read.csv(file = ".\\data\\human2.csv", header = TRUE)

Task 1: Graphical overview of the dataset

str(human)
## 'data.frame':    155 obs. of  9 variables:
##  $ X    : chr  "Norway" "Australia" "Switzerland" "Denmark" ...
##  $ edu2R: num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labR : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ life : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ edu  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ gni  : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ mmr  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ abr  : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parl : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

The HDI was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone. The HDI can also be used to question national policy choices, asking how two countries with the same level of GNI per capita can end up with different human development outcomes.

This data set is partial collection of dataset which calculated the HDI country by country. The human data set contains the following values:

  • X: Header of each row. Every row is a data from a single country
  • life: Life expectancy index
  • gni: Gross National Income. Modified by removing comma from the income.
  • labR: Female and male labour force participation rates
  • edu2R: Female and male population with at least secondary education
  • parl: Female and male shares of parliamentary seats
  • mmr: Maternal Mortality ratio
  • abr: Adolescent Birth Rate
  • edu: Expected years of schooling

Task 2: Perform principal component analysis

pca_human <- prcomp(human[-1])
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 2)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "GNI based PCA")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Task 3: Standardize the variables and repeat the above analysis

pca_human <- prcomp(scale(human[-1]))
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 2)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], main = "Scaled PCA")

When comparing standardized and not standardized biplot charts you can clearly see that the scaling is off. The values of the gni are really large, because at data mangling the comma was taken out and textual value was transferred into numberic. This changed the values into almost 100 000. Without scaling, these gni values are really big compared to other data so the biplot is really off. You can see from not standardized one that the only arrow is gni, and others are so small that they are not even drawn.

Task 4: Personal interpretations, two principal component dimensions

There is a colleration between working men and women and men and women in parliament seats. Both of these factors raise life expency of Human Developmen Index.This is true, because it is correlated to gni which is basically income. When having more income, you have better service and has better way to survive. But it is only 17% of the factors calculated. Comparing data from multiple countries, many women are working and are in parliament.

Rest of the measured values are 53,6% of the reasons which affect the HDI. Maternal Mortal Ratio and Adolescent Birth Rate are on the positive side and it is getting better. On the other hand education and life expetancy and GNI are lacking behind. People die young and are not that educated.

Task 5: Personal interpretations, two principal component dimensions

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
tidyr::gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

mca <- MCA(tea_time, graph = FALSE)
plot(mca, invisible=c("ind"), habillage = "quali")

I chose a dataset which focused on the tea, which people are consumed. Are the bags or loose tea, with sugar or without, where bought, which type, at lunch, or do people consume tea with addition flavours like lemon or milk.

As you can see from MCA factor map, and from the bar charts on the top. The data can be analysed. With MCA factor map you can see that closer to origo, the common it is. Doing the MCA factoring, you can see the compared data of how the values affect each other. Here you can see that Earl Gray is not that away from black tea and the tea is also consumed quite often at lunch time. Milk also pays a larger role than originally expected.